Add DecomposeAggregate optimizer to rewrite AVG as SUM/COUNT#21613
Add DecomposeAggregate optimizer to rewrite AVG as SUM/COUNT#21613Dandandan wants to merge 9 commits intoapache:mainfrom
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (faea6b4) to 29c5dd5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (faea6b4) to 29c5dd5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (faea6b4) to 29c5dd5 (merge-base) diff using: tpch File an issue against this benchmark runner |
Rewrites AVG(x) into CAST(SUM(x) AS Float64) / CAST(COUNT(*) AS Float64) to reduce accumulator overhead. The AVG accumulator stores both sum and count per group internally; splitting into separate SUM and COUNT accumulators is more efficient and enables sharing with existing COUNT(*) aggregates in the same query via CommonSubexprEliminate. Uses COUNT(*) when the AVG argument is non-nullable (the common case for integer/float columns), falling back to COUNT(x) for nullable arguments to preserve correct NULL semantics. Only applies to AVG with Float64 return type; skips DISTINCT, filtered, ordered, and decimal/duration AVGs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ROLLUP/CUBE/GROUPING SETS expand group_expr into more schema fields than group_expr.len(), which breaks the index arithmetic used to locate aggregate field types. Skip these cases to avoid FieldNotFound errors in downstream optimizer rules like optimize_projections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bbe9058 to
4b81d40
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
New test from upstream that includes AVG — update expected EXPLAIN output to reflect the AVG decomposition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark tpch |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmark tpch tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing decompose-avg-optimization (b32f34d) to 6db3899 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Instead of SUM(CAST(x AS Float64)), cast to SUM's native types: Int8/Int16/Int32 → CAST(x AS Int64), UInt8/UInt16/UInt32 → CAST(x AS UInt64), Int64/UInt64/Float64 → no cast. This avoids Float64 arithmetic overhead during aggregation and produces more precise results for large integers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of manually mapping each integer type to SUM's native type, just strip the CAST(x AS Float64) added by AVG's type coercion and let SUM apply its own coercion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove cast-stripping logic from the optimizer rule and pass AVG's args (with their existing Float64 cast) directly to SUM. Add coerce_types to SUM so it handles small integer/float types via the normal planning path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid precision loss for Int64/UInt64 by stripping the Float64 cast only when SUM can handle the inner type directly. For small integer types, keep the cast so SUM operates on Float64. Update clickbench EXPLAIN expectations to match the new plans. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alamb
left a comment
There was a problem hiding this comment.
the idea is pretty intersting -- it is like some sort of CSE that can look into the defintiions of aggregates . We could do something similar for VAR maybe too 🤔
I think if we went with this aproach we should avoid adding an entirely new optimizer rile as each slows down planning. Maybe we can instead extend an existing rule
Do we know why this is, BTW? Would it be better to make Average faster somehow? |
The largest improvement comes from having both Also sum and count compile to better code as this has a clear translation to (simd) instructions whereas count+sum in a single lkop probably does (still pretty fast, but...) generate less efficient code. Perhaps two separate loops (one for counts / one for sums) would improve this, but the CSE optimization (reducing state) is the larger improvement. |
Yes I agree, let me think about it a bit more. |
Which issue does this close?
Closes: #19637
Summary
DecomposeAggregateoptimizer rule that rewritesAVG(x)intoCAST(SUM(x) AS Float64) / CAST(COUNT(*) AS Float64)COUNT(*)for non-nullable args (enables sharing with existingCOUNT(*)via CSE), falls back toCOUNT(x)for nullable argsTest plan
cargo fmtandcargo clippyclean🤖 Generated with Claude Code